1 AIM

The following tutorials teach how to run NCBI [legacy-blast and] BLAST+ from the command line along with useful AWK, Bash and Perl idioms and 1-liner programs to effectively parse results.

2 ASSUMPTIONS

This tutorial expects that the user has a Unix/Linux environment with the standard bash shell installed, as well as the legacy- and blast+ software suites, which are freely available for download from the NCBI ftp server

For further instruction on the installation and usage of the BLAST+ suite, read NCBI’s BLAST Book and NCBI’s The Statistics of Sequence Similarity Scores tutorial.

3 AUTHOR

These tutorials were designed by Pablo Vinuesa, CCG-UNAM, Cuernavaca, México (X - @pvinmex), for the International Workshops on Bioinformatics, TIB2022, and his courses taught at the Bachelor’s Program in Genome Science, LCG-UNAM and Workshop on Genome Sciences, Facultad de Ciencias UNAM.

4 SOURCE REPOSITORY

This material is distributed from the following TIB-filoinfo GitHub repository

5 LICENSE

Creative Commons Licence
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0

6 THE BLAST SUITE OF PROGRAMS - OVERVIEW OF PROGRAM SELECTION

Remember that to run a BLAST SEARCH we need to select the proper blast program, based on the nature of our query sequence(s) and that of the database to be searched for homologues.

query database program note
DNA DNA blastn -
PROT PROT blastp -
DNA PROT blastx The DNA query is translated in the 6 frames and the products searched against protein DB
PROT DNA tblastn The DNA DB (genome) is translated in the 6 frames and queried with protein sequences
DNA DNA tblastx The DNA query and DNA DB are both translated in the 6 frames and searches performed at the protein level

Figure 1. The four standard BLAST programs.

7 Karlin-Altschul Statistics: the blast E-value for an HSP

In 1990, Samuel Karlin and Stephen Altschul published a paper describing the statistics underlying local alignment scores, based on the hit expectancy value or E-value.

\[ E = kmne^{-{\lambda}S} \] This equation states that the number of alignments expected by chance (\(E\)) during a blast search for a given high-scoring pair (HSP) with a raw score (\(S\)), is a function of the size of the search space (effective query sequence length \(m\) * effective length of the database \(n\) (in number of residues)), the normalized score of the hit (\({\lambda}*S\)), and a minor constant \(k\).

The following formula provides the relationship between both values:

\[P = 1 -e^{-E}\]

Note that for values \(<0.001\), the E-value and the P-value for a given HSP are essentially identical.


8 PRELIMINARIES: (running on buluc|tepeu)

The following code will setup our working directories for the tutorials.

# i) download or copy the sequence data to your working directories for this practice
cd && [ ! -d sesion2_blast ] && mkdir sesion2_blast && cd sesion2_blast

# ii) Copy the data from my home directory
cp /home/vinuesa/cursos/intro_biocomp/sesion2_blast/16S_4blastN.tgz .
cp /home/vinuesa/cursos/intro_biocomp/sesion2_blast/gene_discovery_and_annotation_using_blastx.tgz .

# or use wget to fetch the directly from the command line
wget -c https://github.com/vinuesa/TIB-filoinfo/raw/master/docs/sesion3_BLAST/data/16S_4blastN.tgz
wget -c https://github.com/vinuesa/TIB-filoinfo/raw/master/docs/sesion3_BLAST/data/gene_discovery_and_annotation_using_blastx.tgz
wget -c https://github.com/vinuesa/TIB-filoinfo/raw/master/docs/sesion3_BLAST/data/UniProt_proteomes.tgz

# iii) Generate two subdirectories to hold the data and analysis results
mkdir BLAST_DB_AA BLAST_DB_NT UniProt_proteomes

# iv) Save the path to our parental working directory in a variable
wkdir=$(pwd) 
echo $wkdir

# v) Move each *tgz file to the proper dir
mv 16S_4blastN.tgz BLAST_DB_NT
mv UniProt_proteomes.tgz UniProt_proteomes
mv gene_discovery_and_annotation_using_blastx.tgz BLAST_DB_AA

Now we are set to start the hands-on tutorials on running BLAST from the command line

9 Searching a DNA database with DNA queries (BLASTN)

9.1 EXERCISE 1 - classify novel 16S rDNA sequences using blastn

The aim of the exercise is to learn the basics of a BLAST analysis. We will generate an indexed BLAST database of nucleotide sequences and interrogate it using blastn.

We will use the following data for the exercise.

  • Query sequences: 16S_problema.fna, which correspond to partial 16S sequences generated in our lab from environmental microbes recovered contaminated soils and rivers in Morelos, Mexico. Part of these sequences were reported in Ocha-Sánchez & Vinuesa, 2017.
  • Reference sequences to build a blastn-searchable DB: 16S_seqs4_blastDB.fna. This is a tiny portion of the bacterial type strain 16S rRNA sequences fetched from RDPII.

The objective of the exercise is to classify the query sequences in 16S_problema.fna at the genus level based on BLAST statistics.

9.1.1 Exploring the query and reference sequences

Move into the directory holding the nucleotide sequences and explore both files using standard shell filtering tools and modify the FASTA headers in 16S_seqs4_blastDB.fna to make them suitable for formatdb|makeblastdb indexing

# We need to untar & uncompress (gunzip) the compressed tar file
cd BLAST_DB_NT/ && tar -xvzf 16S_4blastN.tgz

# 1) explore the fasta headers: 
grep '>' 16S_seqs4_blastDB.fna |less

# 1.1) How many genera and species per genus are found in the source file 16S_seqs4_blastDB.fna?
# i. the genera
grep '>' 16S_seqs4_blastDB.fna | cut -d_ -f1 | sort | uniq -c | sort -nrk1

# ii. the species
grep '>' 16S_seqs4_blastDB.fna | cut -d_ -f1,2 | sort | uniq -c | sort -nrk1

9.1.2 Using Perl 1-liners to generate FASTA headers suitable for database indexing with makeblastdb

# 1.2) Are the sequences in a suitable format for makeblastdb indexing?
#      What command would you use to inspect the sequences?

# 1.2.1)  generate a proper FASTA header for db indexing
perl -pe 'if( /^>/ ){ $c++; s/>/>gnl|16S_DB|$c / }' 16S_seqs4_blastDB.fna | grep '^>' 
perl -pe 'if( /^>/ ){ $c++; s/>/>gnl|16S_DB|$c / }' 16S_seqs4_blastDB.fna > 16S_seqs4_blastDB.fnaed

9.1.3 Running [formatdb]|makeblastdb to generate an indexed blast database

Note: the commented command lines in the following blocks provide the code to run legacy blast. The exercises are demonstrated using the blast+ suite of programs

#formatdb -i 16S_seqs4_blastDB.fnaed -p F -o T

makeblastdb -h
makeblastdb -help

makeblastdb -in 16S_seqs4_blastDB.fnaed -dbtype nucl -parse_seqids

9.1.4 Identify the single best-hit for each query sequence

# 2.1) run blastn (blastall -p blastn), and get only the best hit (-b 1) in tabular format (-m 8).
# blastall -p blastn -i 16S_problema.fna -b 1 -a 6 -d 16S_seqs4_blastDB.fnaed -m 8 > 16S_problema_blastN_b1_m8.tab
blastn -query 16S_problema.fna -db 16S_seqs4_blastDB.fnaed -outfmt 6 -num_alignments 1 > 16S_problema_blastN_maln1_m6.tab

# 2.2) explore the table structure
head 16S_problema_blastN_maln1_m6.tab
13  16S_DB:480  99.485  1359    5   2   1   1359    48  1404    0.0 2475
18  16S_DB:494  99.033  1344    10  3   1   1344    49  1389    0.0 2416
2   16S_DB:549  99.552  1340    5   1   4   1343    81  1419    0.0 2440
27  16S_DB:546  100.000 1342    0   0   1   1342    78  1419    0.0 2479
36  16S_DB:480  99.705  1358    3   1   1   1358    48  1404    0.0 2490
4   16S_DB:600  99.335  1354    4   4   1   1353    79  1428    0.0 2446
48  16S_DB:484  98.375  1354    20  1   2   1355    48  1399    0.0 2390
53  16S_DB:548  99.925  1342    1   0   1   1342    78  1419    0.0 2473
62  16S_DB:496  99.106  1342    8   4   1   1342    85  1422    0.0 2409
7   16S_DB:600  98.736  1345    14  3   1   1343    79  1422    0.0 2386

As can be seen in the output above, the -outfmt 6 tabular format has no header.

9.1.4.1 The twelve fields of a standad tabular blast output (-outfmt 6|[-m 8])

The default -outfmt 6 blast output contains the following 12 fields

# The default m6 blast output fields
#=====================================================================================
# >>> The default -outfmt 6 | -m 8 blast output contains the following 12 fields   <<<     
# qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
#-------------------------------------------------------------------------------------
# 1.     qseqid      query (e.g., gene) sequence id
# 2.     sseqid      subject (e.g., reference genome) sequence id
# 3.     pident      percentage of identical matches
# 4.     length      alignment length
# 5.     mismatch    number of mismatches
# 6.     gapopen     number of gap openings
# 7.     qstart      start of alignment in query
# 8.     qend        end of alignment in query
# 9.     sstart      start of alignment in subject
#10.     send        end of alignment in subject
#11.     evalue      expect value
#12.     bitscore    bit score
#-------------------------------------------------------------------------------------

9.1.5 Adding a header to the standard tabular output: -outfmt 6 [-m 8 in legacy blast]

# 2.3 Add a header to the standard output; print to file and append to 16S_problema_blastN_maln1_m6.tab
echo -e "qseqid\tsseqid\tpident\tlength\tmismatch\tgapopen\tqstart\tqend\tsstart\tsend\tevalue\tbitscore" > standard_blast_fields.tsv
cat standard_blast_fields.tsv 16S_problema_blastN_maln1_m6.tab > t && mv t 16S_problema_blastN_maln1_m6.tab

head -20 16S_problema_blastN_maln1_m6.tab | column -t

# 2.4) how many unique hits are there?
cut -f 2 16S_problema_blastN_maln1_m6.tab | sed '1d' | sort | uniq -c

9.1.6 Retrieve all the hits from the database using \(blastdbcmd\)

#   Generate the list of hit IDs for each query
#      NOTE: the hits are in the same order as the query sequences
cut -f2 16S_problema_blastN_maln1_m6.tab | sed '1d' > IDs4blastdbcmd.txt

#blastdbcmd -i IDs4blastdbcmd.txt -d 16S_seqs4_blastDB.fnaed | grep '>' |cut -d\| -f3 | cut -d' ' -f2 > hit_sequences_b1.fna
blastdbcmd -entry_batch IDs4blastdbcmd.txt -db 16S_seqs4_blastDB.fnaed > best-hit_sequences.fna

9.1.7 Generate a table containing the query sequence IDs, the subject headers and basic alignment statistics using Linux tools

# 1) Generate a list of the hits from the FASTA headers
grep '>' best-hit_sequences.fna | cut -d' ' -f2 > hit_sequence_headers.txt

# 2) Classify our query sequences, based on the subject headers of each best hit, using a simple tabular output format
cut -f1 16S_problema_blastN_maln1_m6.tab | sed '1d' > problem_seqs.txt
paste problem_seqs.txt hit_sequence_headers.txt > classified_16S_problema_sequences.tab

head -5 classified_16S_problema_sequences.tab
13  Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928
18  Mycobacterium_smegmatis_T|ATCC_19420|AJ131761
2   Mycobacterium_peregrinum_T|CIP_105382T|AY457069
27  Mycobacterium_fortuitum_T|CIP_104534T|AY457066
36  Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928

9.1.8 Generate a table containing the query sequence IDs, the subject headers and basic alignment statistics using process substitution to avoid writing intermediate files

A more efficient and idiomatic (“Bashish”) code to achieve the same result, without having to write intermediate files, is the use of “process substitution”, which has this general syntax: <(any comamnd(s)) or pipeline here). Process substitution replaces the command with its output, treating it as if it were stored in a file. This output can therefore be used by commands like \(cat\), \(paste\) or any other that works on files. Lets see it in action:

paste <(cut -f1 16S_problema_blastN_maln1_m6.tab | sed '1d') <(grep '>' best-hit_sequences.fna | cut -d' ' -f2) | head | column -t
13  Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928
18  Mycobacterium_smegmatis_T|ATCC_19420|AJ131761
2   Mycobacterium_peregrinum_T|CIP_105382T|AY457069
27  Mycobacterium_fortuitum_T|CIP_104534T|AY457066
36  Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928
4   Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270
48  Mycobacterium_gordonae_T|ATCC_14470.|X52923
53  Mycobacterium_neworleansense_T|ATCC_49404|AY457068
62  Mycobacterium_chitae_T|ATCC_19627|X55603
7   Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270
... TRUNCATED

9.1.9 Add basic BLAST statistic information to the summary table using Bash’s “process substitution” idiom

The previous result looks great, but misses key statistical information about the robustness of the hits. Lets use the “process substitution” trick that we have just learned in the previous example to add some extra fields from the blast output table (pident, length, e-value & bit score) to our summary table, including a header.

# 3) add also columns 3, 4, 11, 12 (pident, length,e-value & bit score) to the tabular output, including a header
cat <(echo -e "sid\tpid\tlen\tEval\tscore\ttax") <( paste <(cut -f1,3,4,11,12 16S_problema_blastN_maln1_m6.tab | sed '1d') <(grep '>' best-hit_sequences.fna | cut -d' ' -f2) | head) | column -t
sid  pid      len   Eval  score  tax
13   99.485   1359  0.0   2475   Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928
18   99.033   1344  0.0   2416   Mycobacterium_smegmatis_T|ATCC_19420|AJ131761
2    99.552   1340  0.0   2440   Mycobacterium_peregrinum_T|CIP_105382T|AY457069
27   100.000  1342  0.0   2479   Mycobacterium_fortuitum_T|CIP_104534T|AY457066
36   99.705   1358  0.0   2490   Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928
4    99.335   1354  0.0   2446   Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270
48   98.375   1354  0.0   2390   Mycobacterium_gordonae_T|ATCC_14470.|X52923
53   99.925   1342  0.0   2473   Mycobacterium_neworleansense_T|ATCC_49404|AY457068
62   99.106   1342  0.0   2409   Mycobacterium_chitae_T|ATCC_19627|X55603
7    98.736   1345  0.0   2386   Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270
... TRUNCATED

9.1.10 Filter out high-confidence classifications

It is important to realize that finding a hit in the database does not necessarily mean that it supports a reliable identification or classification of the query sequence. In order to have a reasonably high confidence in the classification of 16S rDNA sequences at the genus level, based on a best \(blastn\) hit, it should at least satisfy the following conditions:

  • alignment lenght >= 500 bp
  • pident >= 95%

9.1.10.1 AWK is a very handy and efficient tool for filtering BLAST output tables

Lets first explore the BLAST table to identify hits that do not satisfy the two criteria defined above (pident >= 95% && length >= 500). \(AWK\) is ideal for this task.

9.1.10.1.1 identify hits with < 95% identity OR short alignments (length < 500) using \(AWK\) patterns
# we could use any of the following lines to print the results, including a header
# cat <(cat standard_blast_fields.tsv) <(awk 'BEGIN{FS=OFS="\t"}$3 < 95 || $4 < 500' 16S_problema_blastN_maln1_m6.tab) | column -t
awk 'BEGIN{FS=OFS="\t"; print "qseqid\tsseqid\tpident\tlength\tmismatch\tgapopen\tqstart\tqend\tsstart\tsend\tevalue\tbitscore"}$3 < 95 || $4 < 500' 16S_problema_blastN_maln1_m6.tab | column -t
qseqid                            sseqid      pident   length  mismatch  gapopen  qstart  qend  sstart  send  evalue    bitscore
1367SAA005_14Cip-FD1.b1.ab1_638   16S_DB:86   93.218   634     39        4        1       632   37      668   0.0       929
1367SAA005_A17Cip-FD1.b1.ab1_171  16S_DB:86   95.930   172     6         1        1       171   35      206   2.51e-76  278
1367SAA005_A6BSm-FD1.b1.ab1_477   16S_DB:86   98.745   478     5         1        1       477   37      514   0.0       848
1367SAA005_S13-FD1.b1.ab1_121     16S_DB:534  88.333   120     11        3        3       121   64      181   2.33e-35  141
1367SAA005_S2CbTm-FD1.b1.ab1_84   16S_DB:269  97.647   85      1         1        1       84    47      131   1.15e-36  145
1367SAA005_S4CbTm-FD1.b1.ab1_172  16S_DB:269  100.000  172     0         0        1       172   52      223   1.49e-88  318
1367SAA005_S6Sm-FD1.b1.ab1_347    16S_DB:184  99.143   350     0         3        1       347   40      389   0.0       627
1367SAA005_S9b-FD1.b1.ab1_142     16S_DB:3    97.887   142     3         0        1       142   40      181   1.60e-67  248
1367SAA005_Tm203-FD1.b1.ab1_491   16S_DB:128  97.942   486     9         1        3       487   52      537   0.0       841
1367SAA005_VNP1-FD1.b1.ab1_578    16S_DB:435  94.965   576     28        1        3       578   31      605   0.0       902
9.1.10.1.2 generate the final (filtered) results table with highly confident classifications

In order to get a confident blastn-based classification of our problem sequences at the genus level, we should use only hits with pident >= 95% and a minimal alignment length >= 500 bp.

paste <(awk 'BEGIN{FS=OFS="\t"}NR > 1{print $1,$3,$4,$11,$12}' 16S_problema_blastN_maln1_m6.tab) <(grep '>' best-hit_sequences.fna | cut -d' ' -f2) | awk 'BEGIN{FS=OFS="\t"; print "sid\tpid\tlen\tEval\tscore\tbest_hit"}$2 >= 95 && $3 >= 500' > 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv

Display the table’s heat and tail 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv

cat <(head 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv) <(tail 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv) | column -t
sid                                   pid      len   Eval  score  best_hit
13                                    99.485   1359  0.0   2475   Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928
18                                    99.033   1344  0.0   2416   Mycobacterium_smegmatis_T|ATCC_19420|AJ131761
2                                     99.552   1340  0.0   2440   Mycobacterium_peregrinum_T|CIP_105382T|AY457069
27                                    100.000  1342  0.0   2479   Mycobacterium_fortuitum_T|CIP_104534T|AY457066
36                                    99.705   1358  0.0   2490   Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928
4                                     99.335   1354  0.0   2446   Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270
48                                    98.375   1354  0.0   2390   Mycobacterium_gordonae_T|ATCC_14470.|X52923
53                                    99.925   1342  0.0   2473   Mycobacterium_neworleansense_T|ATCC_49404|AY457068
62                                    99.106   1342  0.0   2409   Mycobacterium_chitae_T|ATCC_19627|X55603
1367SAA005_S11Sm-FD1.b1.ab1_1049      98.763   1051  0.0   1866   Pseudomonas_monteilii_T|CIP_104883|AF064458
1367SAA005_S12-FD1.b1.ab1_899         98.210   894   0.0   1559   Microbacterium_testaceum_T|DSM_20166|X77445
1367SAA005_S15-FD1.b1.ab1_1111        98.115   1114  0.0   1936   Microbacterium_testaceum_T|DSM_20166|X77445
1367SAA005_S1CbTm-FD1.b1.ab1_683      99.227   647   0.0   1166   Pseudomonas_monteilii_T|CIP_104883|AF064458
1367SAA005_S1-FD1.b1.ab1_542          99.815   542   0.0   996    Pseudomonas_plecoglossicida_T|FPC951|AB009457
1367SAA005_S7CbTm-FD1.b1.ab1_961      98.954   956   0.0   1709   Citrobacter_freundii_T|DSM_30039|AJ233408
1367SAA005_S9-FD1.b1.ab1_1091         99.451   1092  0.0   1984   Cellulosimicrobium_funkei_T|W6122|AY501364
1367SAA005_VNC15-FD1.b1.ab1_1062      99.052   1055  0.0   1890   Rhizobium_alamii_T|type_strain:GBV016|AM931436
1367SAA005_VRP15_bis-FD1.b1.ab1_1124  99.822   1125  0.0   2065   Ensifer_terangae_T|type_strain:_LMG_7834|X68388
1367SAA005_VRP15-FD1.b1.ab1_1112      99.012   1113  0.0   1993   Ensifer_terangae_T|type_strain:_LMG_7834|X68388

The previous examples illustrated how to combine a set of \(Shell\) and \(Linux\) tools for efficient processing of BLAST output tables. With a little practice, you will soon master these power tools.

9.1.10.1.3 Generate summaries of the results

Finally, let’s generate some simple summary statistics of the classified sequences.

# 6 How many sequences could be classified with high confidence?
sed '1d' 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv | wc -l # 42 
42
# 7 How many sequences per genus could be classified with high confidence?
sed '1d' 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv | cut -f6 | sed 's/_.*$//' | sort | uniq -c | sort -nrk1
     14 Mycobacterium
      5 Rhizobium
      5 Ensifer
      4 Pseudomonas
      2 Microbacterium
      2 Escherichia/Shigella
      2 Beijerinckia
      1 Sphingomonas
      1 Sinorhizobium
      1 Shinella
      1 Isoptericola
      1 Citrobacter
      1 Cellulosimicrobium
      1 Brevundimonas
      1 Acinetobacter

9.1.11 HOMEWORK 1

Repeat the blastn-based classification exercise based on 16S rDNA sequences (section 7.1.4 and following sub-sections) using 10 hits per query. Impose the same filters applied in the previous exercise.

  • Are the results of both \(blastn\) runs (single best-hit vs. 10 hits) consistent?
  • How many sequences from the problem dataset remain unclassified, and what is the reason they were not classified?
  • Are there sequences that remain unclassified that have a pident > 90% && < 95%?
    • If so, how many?
    • Based on the evidence gathered from the 5 hits/query, how would you classify these sequences?

10 BLAST+ with protein queries against a protein database (BLASTP)

BLASTP is a key program in genomics and phylogenomics, as it is one of the main tools used to:

10.1 Prepare working directory

  • extract the sequence data from the compressed tar file
# Move into the UniProt_proteomes dir and extract the sequences in the tgz file
cd $wkdir && cd UniProt_proteomes 
tar -xvzf UniProt_proteomes.tgz

10.2 EXERCISE 2: identification of paralogs in large multigene families of eukaryotes - identifying putative small Rab GTPases in the proteome of Acanthamoeba castellanii strain Neff

The family of small Rab GTPases belongs to the Ras GTPase superfamily, which comprise over 150 human members, with evolutionarily conserved orthologs found throughout Eukaryota. The superfamily is divided into five major branches on the basis of sequence and functional similarities: Ras, Rho, Rab, Ran and Arf.

Small GTPases share a common biochemical mechanism and act as binary molecular switches, being active when GTP is bound and inactive when GDP is bound to the protein. Ras family proteins function as monomeric G proteins. Variations in structure and post-translational modifications that dictate specific subcellular locations and the proteins that serve as their regulators and effectors allow these small GTPases to function as sophisticated modulators of a remarkably complex and diverse range of cellular processes.

Rab GTPases are critical in membrane trafficking and insert into specific membrane compartments via their C-terminal geranylgeralyl lipidic tails, serving as organelle identity posts. In their GTP-bound (active) form, Rabs recruit multiple and specific effector proteins, which interact with membrane tethering (e.g. HOPS) and fusion factors (SNAREs), thereby serving a critical role in controlling the specific docking and fusion of membrane-bound compartments along the endocytic, recycling, exocytic, phagocytic and autophagic pathways.

In our research group at CCG-UNAM we use evolutionary, comparative, and functional genomics approaches to study Acanthamoeba castellanii and Stenotrophomonas spp. to identify essential loci and the molecular mechanisms that underlie the origins of complex phenotypes such as virulence, focusing on the cellular microbiology of Acanthamoeba-Stenotrophomonas interactions. We have recently found that Stenotrophomonas maltophilia, a globally emerging multidrug-resistant opportunistic pathogen is able to use Acanthamoeba trophozoites as an alternative replication niche.

Using a combination of bioinformatics, molecular and cell-biological approaches, we discrovered that Stenotrophomonas is resistant to digestion by A. castellanii, a free-living amoeba that we use in our lab as a proffesional phagocye model (Rivera et al. 2023, accepted). More specifically, we found that:

  1. Many Rab proteins in the A. castellanii reference (RefSeq) proteome are misannotated, including errors in their predicted gene structures

  1. Stenotrophomonas maltophilia bacteria replicate withing acidified, Rab7A-positive vacuoles (phagosomes) of Acanthamoeba castellanii trophozoites.

In the following exercise, we will use \(blastp\) to study the large family of Rab GTPases encoded by the Acanthamoeba castellanii genome, using the well-annotated and canonical RAB7A_HUMAN protein from as the query.

10.3 Format the Acanthamoeba castellanii reference proteome fetched from UNIPROT

UniProt is the world’s leading high-quality, comprehensive and freely accessible resource of protein sequence and functional information. You should explore this extraordinary resource and read the following paper from the UniProt Consortium - UniProt: the Universal Protein Knowledgebase in 2023.

makeblastdb -in ACACA.faa -dbtype prot -parse_seqids
ACACA.faa
ACACA.faa.pdb
ACACA.faa.phr
ACACA.faa.pin
ACACA.faa.pog
ACACA.faa.pos
ACACA.faa.pot
ACACA.faa.psq
ACACA.faa.ptf
ACACA.faa.pto
DICDI.faa

10.4 Deep \(BLASTP\) search to find A. castellanii Ras-superfamily members, including Rabs, using RAB7A_HUMAN as the query

The aim of this exercise is to identify A. castellanii Ras-superfamily members, using RAB7A_HUMAN as the query. Given the huge evolutionary distance separating humans from Amoebozoa (probably > 1.2 billion yrs), we are going to use the BLOSUM45 matrix.

Fig. 1 from Brocks et. al 2023. Nature 2023 Jun;618(7966):767-773. doi: 10.1038/s41586-023-06170-w.

Furthermore, since the Ras superfamily is expected to contain many members, possibly several hundreds, we are going to request 300 hits, using custom output fields in tabular format.

blastp -matrix BLOSUM45 -query RAB7A_HUMAN.faa -db ACACA.faa -outfmt '6 qseqid qlen sseqid slen stitle length pident gaps evalue bitscore qcovs' -num_alignments 300 > RAB7A_HUMAN_vs_ACACA_300hits.out
  • lets find out how many hits we got
wc -l RAB7A_HUMAN_vs_ACACA_300hits.out
244

10.4.1 Explore the distribution of key alignment statistics using the col_sumStats.sh script

There are many (244) rows to explore. It is convenient to get simple summary statistics for key alignment parameters, such as: - E-value (column 9) - bit_score (column 10) - qcovs (column 11) - slen (column 4) - pident (column 7)

  1. Copy the following code to a file named col_sumStats.sh and move it to your $HOME/bin directory
#!/usr/bin/env bash
# AUTHOR: Pablo Vinuesa. CCG-UNAM. @pvinmex
# little awk script to compute basic summary statistics
#  receives the output of cut for a numeric column in a table 
sort -n | awk '
BEGIN{ max = min = Mode = 'NaN' }
{     
     sum+=$1
     a[x++]=$1
     b[$1]++
     if(b[$1]>hf){hf=b[$1]}
}
NR == 1 { min=$1; max=$1 }
$1 < min { min=  $1 }
$1 > max { max=  $1 }

END{ n = asort(a);idx=int((x+1)/2)
    print "Min: " min
    print "Mean: " sum/x 
    print "Median: " ((idx==(x+1)/2) ? a[idx] : (a[idx]+a[idx+1])/2) 
    for (i in b){if(b[i]==hf){(k=="") ? (k=i):(k=k FS i)}{FS=","}}
    print "Mode: " k
    print "Max: " max
}
'

Alternatively, you may fetch it with the following command from my GitHub repo

cd ~/bin

wget -c https://raw.githubusercontent.com/vinuesa/TIB-filoinfo/master/col_sumStats.sh

# takes you back again to your working directory
cd -
  1. Now run the following lines of code:
echo "# E-value"
cut -f 9 RAB7A_HUMAN_vs_ACACA_300hits.out | col_sumStats.sh
echo
echo "# bit score"
cut -f 10 RAB7A_HUMAN_vs_ACACA_300hits.out | col_sumStats.sh
echo
echo "# qcovs"
cut -f 11 RAB7A_HUMAN_vs_ACACA_300hits.out | col_sumStats.sh
echo
echo "# pident"
cut -f 7 RAB7A_HUMAN_vs_ACACA_300hits.out | col_sumStats.sh
echo
echo "# slen"
cut -f 4 RAB7A_HUMAN_vs_ACACA_300hits.out | col_sumStats.sh
# E-value
Min: 6.25e-103
Mean: 0.250727
Median: 9.805e-15
Mode: 0.001
Max: 6.9

# bit score
Min: 26.3
Mean: 72.9398
Median: 68.85
Mode: 27.5
Max: 292

# qcovs
Min: 9
Mean: 68.6885
Median: 77
Mode: 78
Max: 100

# pident
Min: 20.000
Mean: 31.7386
Median: 30.5675
Mode: 31.452
Max: 71.635

# slen
Min: 94
Mean: 440.131
Median: 217.5
Mode: 193,195
Max: 2812

From the output above, we can conclude that there are many hits with very poor E-values (>= 0.001) and scores (< 50). There is also great variation in the subject sequence lenght (slen) and query coverage (qcovs).

10.4.2 Explore the alignment length distribution using a simple tabular plus graphical representation on the terminal

awk -F"\t" '{l=$4; a[l]++; if (a[l]>max) max=a[l]} END {printf("Length\tFrequency"); for (i in a) if(a[i] >= 2){printf("\n%s\t%s\t",i,a[i]); for (j=0;j<(int(a[i]*(100/max)));j++) printf("*")} print ""}' RAB7A_HUMAN_vs_ACACA_300hits.out
Length  Frequency
178 2   *********************************
179 3   **************************************************
180 3   **************************************************
181 4   ******************************************************************
184 5   ***********************************************************************************
186 5   ***********************************************************************************
188 3   **************************************************
190 4   ******************************************************************
193 6   ****************************************************************************************************
194 2   *********************************
195 6   ****************************************************************************************************
196 3   **************************************************
197 3   **************************************************
198 3   **************************************************
199 4   ******************************************************************
200 3   **************************************************
202 2   *********************************
203 3   **************************************************
204 3   **************************************************
205 2   *********************************
206 5   ***********************************************************************************
207 2   *********************************
208 3   **************************************************
209 2   *********************************
210 3   **************************************************
211 3   **************************************************
213 3   **************************************************
216 5   ***********************************************************************************
217 2   *********************************
218 2   *********************************
220 2   *********************************
221 3   **************************************************
223 2   *********************************
228 4   ******************************************************************
244 3   **************************************************
264 2   *********************************
266 2   *********************************
273 2   *********************************
421 2   *********************************
514 2   *********************************
558 2   *********************************
578 2   *********************************
988 2   *********************************

Seems that most homologs fall in the range between 178 and 250.

  • A better histogram, including a density plot, can be easily generated with base \(R\) using the following code:
# save the subject-length column data to a file to be read by R
cut -f4 RAB7A_HUMAN_vs_ACACA_300hits.out > qlen.list

# call R
R

# read data
dfr <- read.csv("qlen.list", header=F)

# plot 
hist(dfr$V1, breaks = "Freedman-Diaconis", xlim = c(0, 3000), probability=TRUE)
abline(v = median(dfr$V1), col = "blue", lwd = 3)
abline(v = mean(dfr$V1), col = "red", lwd = 3)
lines(density(dfr$V1), col = 'green', lwd = 3)

which generates the following graph:

Figure 4. Histogram and density plot of RAB7A_HUMAN \(blastp\) hit lengths in the A. castellanii proteome. The median and mean lengths are highlighted as blue and red vertical bars, respectively.

10.4.3 Filter likely canonical Rab GTPase hits based on key alignment parameters

From the previous exploratory data analysis, we can conclude that the hits for likely canonical small Rab-GTPase family homologues should meet the following (relaxed) attributes:

- slen >= 178 && slen < 270
- E-value < 1e-10
- bit-score > 70
- qcovs > 72
- pident > 28

Filtering the table according to those criteria yields 80 hits:

awk 'BEGIN{FS=OFS="\t"; print "sseqid\tslen\tpident\tgaps\tevalue\tbitscore\tqcovs"} ($4 > 178 && $4 < 270) && $7 >= 28 && $9 < 1e-10 && $10 > 70 && $11 > 72 {print $3,$4,$7,$8,$9,$10,$11}' RAB7A_HUMAN_vs_ACACA_300hits.out | column -t -s $'\t'
sseqid                  slen  pident  gaps  evalue     bitscore  qcovs
tr|L8HHF3|L8HHF3_ACACA  186   71.635  23    6.25e-103  292       100
tr|L8GPH2|L8GPH2_ACACA  205   58.537  2     1.20e-86   251       99
tr|L8GXM9|L8GXM9_ACACA  197   60.736  2     5.09e-64   193       78
tr|L8GRY8|L8GRY8_ACACA  186   47.343  28    1.02e-58   179       98
tr|L8GLL4|L8GLL4_ACACA  206   50.000  4     4.58e-54   168       78
... TRUNCATED
tr|L8GLE5|L8GLE5_ACACA  193   28.302  3     9.76e-18   74.2      77
tr|L8H8M5|L8H8M5_ACACA  228   28.986  27    1.72e-17   74.2      90
tr|L8GIH3|L8GIH3_ACACA  198   29.944  11    3.36e-17   73.0      86
tr|L8GIM0|L8GIM0_ACACA  203   34.177  7     9.07e-17   71.9      75
tr|L8HCN9|L8HCN9_ACACA  194   31.447  32    1.04e-16   71.6      77

To further validate the selected hits as canonical small Rab-GTPases, these should be aligned and subjected to phylogenetic analysis along with reference sequences. We will perform this analysis in later sections of the course.


10.5 EXERCISE 3: Identify HUMAN vs. Acanthamoeba reciprocal best hits (RBHs) in the Rab-GTPase multigene family

In the previous exercise we identified a selection of likely Rab-GTPase homologs in the A. castellanii proteome. Our next aim is to identify reciprocal best-hits or RBHs between a set of human reference Rabs retrieved from UNIPROT, and saved in file HUMAN_63Rabs_Ran.faa , and the A. castellanii proteome. The resulting RBHs are likely orthologs, and therefore we could transfer the annotation of the HUMAN Rabs to their Acanthamoeba RBHs.

10.5.1 Overview of the bioinformatic strategy to find RBHs

The bioinformatic strategy to find RBHs is be implemented in the following steps:

  1. run \(blastp\) against the A. castellanii proteome database using HUMAN_63Rabs_Ran.faa as query using the BLOSUM45 matrix
  2. retrieve the A. castellanii proteome database hits using \(blastdbcmd\)
  3. make a blastdb for HUMAN_63Rabs_Ran.faa and query it with the retrieved A. castellanii proteome database hits
  4. sort the blastp output table from the preceding search by increasing E-values (on column 11)
  5. filter out unique HUMAN vs ACA RBHs using AWK hashes from the sorted blast output table

10.5.1.1 Prepare a working directory for the RBH searches

Whenever possible use symlinks to avoid unnecessary file copying to preserve precious disk space

mkdir RBHs
cd RBHs/
  
# make symlinks whenever possible, to avoid unnecessary file copying to avoid filling the server's disks
ln -s ../ACACA.faa .
ln -s ../ACACA.faa.* .
ln -s ../HUMAN_63Rabs_Ran.faa .

10.5.1.2 Run \(blastp\) against the A. castellanii proteome database using HUMAN_63Rabs_Ran.faa as the query using the BLOSUM45 matrix

Given the huge evolutionary distance between human and amoebae, we shall use the BLOSUM45 matrix to maximize sensitivity.

blastp -query HUMAN_63Rabs_Ran.faa -db ACACA.faa -matrix BLOSUM45 -outfmt 6 -num_alignments 10 > HUMAN_63Rabs_vs_ACACA_10hits_m6.out

10.5.1.3 Retrieve the A. castellanii proteome database hits using \(blastdbcmd\)

blastdbcmd -entry_batch <(cut -f2 HUMAN_63Rabs_vs_ACACA_10hits_m6.out) -db ACACA.faa > ACACA_vs_HUMAN_best_Rab_hits.faa

10.5.1.4 Make a blastdb for HUMAN_63Rabs_Ran.faa and query it with the retrieved A. castellanii proteome database hits

makeblastdb -in HUMAN_63Rabs_Ran.faa -dbtype prot -parse_seqids
blastp -query ACACA_vs_HUMAN_best_Rab_hits.faa -db HUMAN_63Rabs_Ran.faa -matrix BLOSUM45 -outfmt 6 -num_alignments 10 > ACACA_vs_HUMAN_RBHs_m6.out

10.5.1.5 Sort the blastp output table from the preceding search by increasing E-values (on column 11)

# The following lines extract a sorted list of unique Acanthamoeba identifiers 
#   found in the ACA_vs_HUMAN and HUMAN_vs_ACA searches
cat <(cut -f1 ACACA_vs_HUMAN_RBHs_m6.out) <(cut -f2 HUMAN_63Rabs_vs_ACACA_10hits_m6.out) | sort | uniq -c
cat <(cut -f1 ACACA_vs_HUMAN_RBHs_m6.out) <(cut -f2 HUMAN_63Rabs_vs_ACACA_10hits_m6.out) | sort -u

# grep out some of the Acanthamoeba hits from the HUMAN_63Rabs_vs_ACACA_10hits_m6.out 
#   and sort the resulting list by increasin E-value (on col 11)
grep  -E 'L8GF16|L8GYY7|L8H946|L8HHF3' HUMAN_63Rabs_vs_ACACA_10hits_m6.out | sort -gk11,11

# This code greps out the sorted and uniique ACACA hits from the HUMAN_63Rabs_vs_ACACA_10hits_m6.out table
for ACAid in $(cat <(cut -f1 ACACA_vs_HUMAN_RBHs_m6.out) <(cut -f2 HUMAN_63Rabs_vs_ACACA_10hits_m6.out) | sort -u); do grep $ACAid HUMAN_63Rabs_vs_ACACA_10hits_m6.out; done

# Same as previous line, but sorting the blastp output table by increasing E-values (on column 11; sort -gk11,11)
#  output displayed below
for ACAid in $(cat <(cut -f1 ACACA_vs_HUMAN_RBHs_m6.out) <(cut -f2 HUMAN_63Rabs_vs_ACACA_10hits_m6.out) | sort -u); do grep $ACAid HUMAN_63Rabs_vs_ACACA_10hits_m6.out; done | sort -gk11,11 > HUMAN_63Rabs_vs_ACACA_10hits_m6.out.sorted
  • Sorted output table generated by last line of code, by increasing E-values. Note the presence of repeated Acanthamoeba identifiers (2cnd col)
    sp|P62826|RAN_HUMAN L8GF16  82.524  206 36  0   11  216 8   213 7.64e-129   359
    sp|P61106|RAB14_HUMAN   L8H946  73.953  215 54  1   1   215 1   213 2.88e-117   329
    sp|P51149|RAB7A_HUMAN   L8HHF3  71.635  208 36  4   1   207 1   186 6.25e-103   292
    sp|P61019|RAB2A_HUMAN   L8H946  65.241  187 65  0   3   189 6   192 1.19e-90    261
    sp|Q8WUD1|RAB2B_HUMAN   L8H946  65.079  189 66  0   1   189 4   192 3.44e-90    260
    sp|P61018|RAB4B_HUMAN   L8H946  66.857  175 58  0   5   179 6   180 6.71e-89    257
    sp|P20338|RAB4A_HUMAN   L8H946  66.854  178 59  0   7   184 3   180 7.71e-88    254
    sp|Q9NP72|RAB18_HUMAN   L8GYY7  58.163  196 80  2   9   203 15  209 4.10e-80    234
    sp|P51151|RAB9A_HUMAN   L8HHF3  48.241  199 83  3   4   201 5   184 2.34e-62    188
    sp|Q15907|RB11B_HUMAN   L8H946  52.874  174 82  0   8   181 6   179 9.51e-62    188
    sp|P62491|RB11A_HUMAN   L8H946  49.474  190 96  0   8   197 6   195 1.23e-61    187
    sp|Q9NP90|RAB9B_HUMAN   L8HHF3  46.970  198 86  2   4   200 5   184 2.92e-60    183
    ... TRUNCATED
    

10.5.1.6 filter out unique HUMAN vs ACA RBHs using AWK hashes from the sorted blast output table

Note: if you need help with AWK and Linux for bioinformatics, visit my detailed AWK tutorial for bioinformatics here

awk 'BEGIN{FS=OFS="\t"}{HUMANid[$1]++; ACAid[$2]++; if(HUMANid[$1] == 1 && ACAid[$2] == 1) print }' HUMAN_63Rabs_vs_ACACA_10hits_m6.out.sorted
awk 'BEGIN{FS=OFS="\t"}{HUMANid[$1]++; ACAid[$2]++; if(HUMANid[$1] == 1 && ACAid[$2] == 1) print }' HUMAN_63Rabs_vs_ACACA_10hits_m6.out.sorted > HUMAN_63Rabs_vs_ACACA_RBHs.tsv
  • The final table of fifteen HUMAN vs. ACACA Rab-GTPase RBHs, which are likely orthologues
    sp|P62826|RAN_HUMAN L8GF16  82.524  206 36  0   11  216 8   213 7.64e-129   359
    sp|P61106|RAB14_HUMAN   L8H946  73.953  215 54  1   1   215 1   213 2.88e-117   329
    sp|P62491|RB11A_HUMAN   L8HJ43  75.829  211 51  0   3   213 2   212 9.90e-117   328
    sp|Q8WUD1|RAB2B_HUMAN   L8H2N2  70.046  217 63  2   1   216 1   216 4.48e-110   311
    sp|P20340|RAB6A_HUMAN   L8GN15  81.287  171 32  0   8   178 4   174 1.25e-104   296
    sp|P51149|RAB7A_HUMAN   L8HHF3  71.635  208 36  4   1   207 1   186 6.25e-103   292
    sp|P61018|RAB4B_HUMAN   L8HCK2  65.888  214 62  3   1   213 1   204 2.29e-102   291
    sp|Q13637|RAB32_HUMAN   L8HAC7  68.681  182 55  2   23  202 11  192 9.57e-94    270
    sp|P62820|RAB1A_HUMAN   L8HD31  69.412  170 52  0   7   176 3   172 4.80e-91    262
    sp|Q9NP72|RAB18_HUMAN   L8GYY7  58.163  196 80  2   9   203 15  209 4.10e-80    234
    sp|Q92930|RAB8B_HUMAN   L8HGB4  72.917  144 39  0   3   146 12  155 4.75e-80    233
    sp|P20339|RAB5A_HUMAN   L8H6F4  76.415  106 25  0   16  121 2   107 8.12e-59    178
    sp|Q13636|RAB31_HUMAN   L8GKG8  46.734  199 97  3   5   194 19  217 2.73e-58    178
    sp|O95755|RAB36_HUMAN   L8H339  47.926  217 101 6   82  294 14  222 3.82e-57    181
    sp|Q969Q5|RAB24_HUMAN   L8H2Q9  50.549  182 72  4   6   173 4   181 6.20e-56    172
    

10.6 EXERCISE 4: PROTEIN IDENTIFICATION/ANNOTATION using blastp

AIMS & data:

  1. run \(blastp\) to annotate a protein with its most likely function
  2. identify CDSs (genes) on anonymous DNA strands in the query file: 3cass_amplicons.fna by means of \(blastx\). To do so, we need to format a blastx-searchable DB using the protein sequences provided in the file integron_cassettes4blastdb.faa

Background info: The 3cass_amplicons.fna sequences were generated in our laboratory from diverse multidrug-resistant Enterobacteriaceae isolates recovered from polluted rivers in Morelos, Mexico. We used PCR to screen our collection of isolates for the presence of the intI integrase gene with the conserved HS458/HS459 primers, which is a hallmark of class-I integrons. Positive isolates were then subjected to PCR with the conserved CS-F & CS-R primers that amplify gene cassette arrays captured by integrons.

Figure 2. The structure of a class-I integron.

If you like, have a look at this nice video-protocol on Screening Foodstuffs for Class 1 Integrons and Gene Cassettes by Liette S. Waldron and Michael R. Gillings from the Department of Biological Sciences, Macquarie University. June 19, 2015 in J Vis Exp. 2015; (100): 52889. doi: 10.3791/52889.

10.6.1 Prepare working directories

  • extract the sequence data from the compressed tar file
# Move into the BLAST_DB_AA dir and extract the sequences in the tgz file
cd $wkdir && cd BLAST_DB_AA 

tar -xvzf gene_discovery_and_annotation_using_blastx.tgz

10.6.2 Exploring and fixing FASTA headers for database indexing

10.6.2.1 Explore the FASTA headers

# count sequences
grep -c integron_cassettes4blastdb.faa  # 2297 sequences

# explore headers
grep '^>' integron_cassettes4blastdb.faa | head -20
>gi|17026024 |[Achromobacter xylosoxidans]|AXI2|||||||Japan:Maebashi||clinical strain; synonym:Alcaligenes xylosoxidans||blaIMP-10|metallo-beta-lactamase IMP-10|741|AB074435|Iyobe_S.|Antimicrob.AgentsChemother.46_2014-2016_2002|12019129
>gi|17026026 |[Acinetobacter baumannii]|ABI1|||||||Japan:Maebashi||clinical strain||blaIMP-11|metallo-beta-lactamase IMP-11|738|AB074436|unpubl
>gi|34482004 |[Vibrio cholerae O1]||||||||||||aadA1|aminoglycoside adenyltransferase|792|AB107663|unpubl
>gi|38524487 |[Vibrio cholerae non-O1/non-O139]|||||Non O1/Non O139|||||||dfrA15|dihydrofolate reductase|474|AB113114|unpubl
>gi|51699488 |[Escherichia coli]|D49||feces|||||China: Guangzhou||patient||aadA1|aminoglycoside adenylyltransferase|792|AB189263|Su_J.|FEMSMicrobiol.Lett.254_75-80_2006|16451182
...

10.6.2.2 Fix the headers

# 1. change >gi|17026026 |[Acinetobacter baumannii]|ABI1... ==> 
#             >gnl|casseteDB|17026026 |[Acinetobacter baumannii]|ABI1...
# and 
# 2. simplify the header a bit, to avoid the many empty fields

grep '>' integron_cassettes4blastdb.faa | head -1 | sed 's#|#\n#g' | cat -n
     1  >gi
     2  17026024 
     3  [Achromobacter xylosoxidans]
     4  AXI2
     5  
     6  
     7  
     8  
     9  
    10  
    11  Japan:Maebashi
    12  
    13  clinical strain; synonym:Alcaligenes xylosoxidans
    14  
    15  blaIMP-10
    16  metallo-beta-lactamase IMP-10
    17  741
    18  AB074435
    19  Iyobe_S.
    20  Antimicrob.AgentsChemother.46_2014-2016_2002
    21  12019129
# use an AWK one-liner to do the complete job. Use the list of |-delimited fields listed above to select some of the most relevant ones
# NOTE that we add an extra field in the following transformation: >gi|17026026 => >gnl|casseteDB|17026026; 
#  therefore we need to add one to the field numbers listed above in order to get the desired ones
awk 'BEGIN{FS=OFS="|"}/>/{sub(/^>gi/, ">gnl|casseteDB"); print $1,$2,$3,$4,$16,$17,$18,$19,$22}!/>/{print $1}' integron_cassettes4blastdb.faa | grep '^>' | head -5
>gnl|casseteDB|17026024 |[Achromobacter xylosoxidans]|blaIMP-10|metallo-beta-lactamase IMP-10|741|AB074435|12019129
>gnl|casseteDB|17026026 |[Acinetobacter baumannii]|blaIMP-11|metallo-beta-lactamase IMP-11|738|AB074436|
>gnl|casseteDB|34482004 |[Vibrio cholerae O1]|aadA1|aminoglycoside adenyltransferase|792|AB107663|
>gnl|casseteDB|38524487 |[Vibrio cholerae non-O1/non-O139]|dfrA15|dihydrofolate reductase|474|AB113114|
>gnl|casseteDB|51699488 |[Escherichia coli]|aadA1|aminoglycoside adenylyltransferase|792|AB189263|16451182

# save the edited faa file as integron_cassettes4blastdb.faaED
awk 'BEGIN{FS=OFS="|"}/>/{sub(/^>gi/, ">gnl|casseteDB"); print $1,$2,$3,$4,$16,$17,$18,$19,$22}!/>/{print $1}' integron_cassettes4blastdb.faa > integron_cassettes4blastdb.faaED

# lets check the result a final time
grep '^>' integron_cassettes4blastdb.faaED | head -20

10.6.3 Formatting a protein database with makeblastdb

# 2) format DB with makeblastdb
# formatdb -i integron_cassettes4blastdb.faaED -p T -o T -n integron_cassetteDB # legacy blast
makeblastdb -in integron_cassettes4blastdb.faaED -dbtype prot -parse_seqids
Building a new DB, current time: 10/18/2022 19:04:31
New DB name:   /home/vinuesa/cursos/intro_biocomp/sesion2_blast/BLAST_DB_AA/integron_cassettes4blastdb.faaED
New DB title:  integron_cassettes4blastdb.faaED
Sequence type: Protein
Deleted existing Protein BLAST database named /home/vinuesa/cursos/intro_biocomp/sesion2_blast/BLAST_DB_AA/integron_cassettes4blastdb.faaED
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 2297 sequences in 0.490659 seconds.
# List the output generated by makeblastdb, sorted by modification time in reverse manner
ls -ltr
-rw-r--r--. 1 vinuesa faculty   4505 ago 31  2011 3cass_amplicons.fna
-rw-r--r--. 1 vinuesa faculty 924971 ago 31  2011 integron_cassettes4blastdb.faa
-rw-r--r--. 1 vinuesa faculty   1129 jul 26  2019 unknown_prot.faa
-rw-r--r--. 1 vinuesa faculty 751713 oct 10 20:15 gene_discovery_and_annotation_using_blastx.tgz
-rw-r--r--. 1 vinuesa faculty 803081 oct 18 19:03 integron_cassettes4blastdb.faaED
-rw-r--r--. 1 vinuesa faculty 566907 oct 18 19:04 integron_cassettes4blastdb.faaED.psq
-rw-r--r--. 1 vinuesa faculty   9220 oct 18 19:04 integron_cassettes4blastdb.faaED.pog
-rw-r--r--. 1 vinuesa faculty  18520 oct 18 19:04 integron_cassettes4blastdb.faaED.pin
-rw-r--r--. 1 vinuesa faculty 329297 oct 18 19:04 integron_cassettes4blastdb.faaED.phr
-rw-r--r--. 1 vinuesa faculty   9192 oct 18 19:04 integron_cassettes4blastdb.faaED.pto
-rw-r--r--. 1 vinuesa faculty  27572 oct 18 19:04 integron_cassettes4blastdb.faaED.pot
-rw-r--r--. 1 vinuesa faculty  16384 oct 18 19:04 integron_cassettes4blastdb.faaED.ptf
-rw-r--r--. 1 vinuesa faculty  63027 oct 18 19:04 integron_cassettes4blastdb.faaED.pos
-rw-r--r--. 1 vinuesa faculty 188416 oct 18 19:04 integron_cassettes4blastdb.faaED.pdb

10.6.4 Running \(blastp\) to annotate proteins

The aim of the exercise is to assign functions or annotations to the problem proteins found in file unknown_prot.faa

10.6.4.1 Explore the unknown_prot.faa file with the query sequences

# How many proteins sequences are stored in unknown_prot.faa
grep -c '^>' unknown_prot.faa

cat unknown_prot.faa

# How do we run blast? => explore the program's help output
blastp -h     # short format
blastp -help  # long  format

10.6.4.2 Runing a standard blastp search with tabular output (-outfmt 6)

# launch a standard blastp search, asking the program to provide a tabular format for the top 10 hits
blastp -query unknown_prot.faa -db integron_cassettes4blastdb.faaED -outfmt 6 -num_alignments 10

10.6.4.3 Improving the tabular output with custom fields

#>>> Lets add some interesting information to the table, missing in the standard -outfmt 6 output
# run blastp using outfmt 6 and passing a user-specified formatting string
blastp -query unknown_prot.faa -db integron_cassettes4blastdb.faaED -outfmt '6 qseqid sseqid qlen slen qstart qend sstart send length pident positive mismatch gaps evalue bitscore qcovs' -num_alignments 10 > blastp_10hits.out

# prepare a header file with the field names for easier interpretation of the blastp ouptput table
echo -e 'qseqid\tsseqid\tqlen\tslen\tqstart\tqend\tsstart\tsend\tlength\tpident\tpositive\tmismatch\tgaps\tevalue\tbitscore\tqcovs' > header1.tab
cat header1.tab

# add the header to the output table
cat header1.tab blastp_10hits.out > blastp_10hits.tsv
column -t blastp_10hits.tsv | head -20 
qseqid                                sseqid                   qlen  slen  qstart  qend  sstart  send  length  pident   positive  mismatch  gaps  evalue     bitscore  qcovs
gi|11037699|gb|AAG27703.1|AF300454_1  gnl|casseteDB|87128428   266   267   1       266   1       266   266     100.000  266       0         0     0.0        537       100
gi|11037699|gb|AAG27703.1|AF300454_1  gnl|casseteDB|145321080  266   267   1       266   1       266   266     99.624   266       1         0     0.0        536       100
gi|11037699|gb|AAG27703.1|AF300454_1  gnl|casseteDB|44844209   266   267   1       266   1       266   266     99.624   266       1         0     0.0        536       100
gi|11037699|gb|AAG27703.1|AF300454_1  gnl|casseteDB|102231612  266   267   1       266   1       266   266     99.248   266       2         0     0.0        535       100
gi|11037699|gb|AAG27703.1|AF300454_1  gnl|casseteDB|38492196   266   267   1       266   1       266   266     90.226   256       26        0     0.0        497       100
gi|11037699|gb|AAG27703.1|AF300454_1  gnl|casseteDB|118402723  266   267   1       266   1       266   266     89.850   255       27        0     0.0        494       100
gi|11037699|gb|AAG27703.1|AF300454_1  gnl|casseteDB|66277418   266   267   1       266   1       266   266     89.098   252       29        0     3.06e-179  489       100
gi|11037699|gb|AAG27703.1|AF300454_1  gnl|casseteDB|148455769  266   267   1       266   1       266   266     87.594   248       33        0     1.83e-176  483       100
gi|11037699|gb|AAG27703.1|AF300454_1  gnl|casseteDB|30014072   266   266   1       266   1       265   266     73.684   223       69        1     6.02e-146  405       100
gi|11037699|gb|AAG27703.1|AF300454_1  gnl|casseteDB|17026024   266   247   43      252   24      227   211     33.175   120       133       8     1.82e-37   128       79
gi|9294830|gb|AAF86698.1|AF180959_1   gnl|casseteDB|56691997   390   387   5       390   1       386   386     99.482   384       2         0     0.0        789       99
gi|9294830|gb|AAF86698.1|AF180959_1   gnl|casseteDB|146151072  390   383   5       388   1       382   388     45.361   251       202       10    2.45e-114  335       98
gi|9294830|gb|AAF86698.1|AF180959_1   gnl|casseteDB|40287460   390   382   13      387   5       381   379     39.314   232       224       6     7.18e-96   287       96
gi|9294830|gb|AAF86698.1|AF180959_1   gnl|casseteDB|133905118  390   410   13      387   33      409   379     39.578   231       223       6     5.97e-95   286       96
gi|9294830|gb|AAF86698.1|AF180959_1   gnl|casseteDB|76057210   390   380   1       387   1       379   396     38.131   232       219       26    4.63e-88   267       99
gi|9294830|gb|AAF86698.1|AF180959_1   gnl|casseteDB|146151112  390   276   316     380   99      165   67      22.388   34        50        2     2.4        25.8      17
gi|9294830|gb|AAF86698.1|AF180959_1   gnl|casseteDB|133905089  390   126   4       37    2       35    34      38.235   22        21        0     5.8        23.9      9
gi|9294830|gb|AAF86698.1|AF180959_1   gnl|casseteDB|161087277  390   80    133     201   15      78    79      25.316   34        34        25    6.9        23.1      18
gi|9294830|gb|AAF86698.1|AF180959_1   gnl|casseteDB|133905132  390   101   255     287   71      98    33      36.364   18        16        5     7.4        23.1      8
gi|9294830|gb|AAF86698.1|AF180959_1   gnl|casseteDB|133905029  390   337   100     154   247     287   55      21.818   24        29        14    7.8        24.3      14

10.6.5 Parsing the output table to select only high confidence annotations

As in the previous exercise on 16S rDNA sequence classification, for a confident annotation of a protein it is important to consider some basic alignment statistics.

In our particular case, since the query proteins correspond to gene cassettes encoding for antibiotic resistance, we are interested in identifying proteins down to the variant level. Therefore, we will impose very stringent filtering criteria to keep only hits with high pident and qcov.

10.6.5.1 Generate a numbered list of header fields for convenient referencing

# Print numbered list of fasta header fields to filter hits with pident >= 95% and qcovs >= 97%
sed 's/\t/\n/g' header1.tab | cat -n
     1  qseqid
     2  sseqid
     3  qlen
     4  slen
     5  qstart
     6  qend
     7  sstart
     8  send
     9  length
    10  pident
    11  positive
    12  mismatch
    13  gaps
    15  bitscore
    16  qcovs

10.6.5.2 Filter the table using AWK

# Filter out hits with >= 95% pident && qcovs >= 97%
awk 'BEGIN{FS=OFS="\t"} $10 >= 95 && $16 >= 97' blastp_10hits.tsv | column -t
qseqid                                sseqid                   qlen  slen  qstart  qend  sstart  send  length  pident   positive  mismatch  gaps  evalue  bitscore  qcovs
gi|11037699|gb|AAG27703.1|AF300454_1  gnl|casseteDB|87128428   266   267   1       266   1       266   266     100.000  266       0         0     0.0     537       100
gi|11037699|gb|AAG27703.1|AF300454_1  gnl|casseteDB|145321080  266   267   1       266   1       266   266     99.624   266       1         0     0.0     536       100
gi|11037699|gb|AAG27703.1|AF300454_1  gnl|casseteDB|44844209   266   267   1       266   1       266   266     99.624   266       1         0     0.0     536       100
gi|11037699|gb|AAG27703.1|AF300454_1  gnl|casseteDB|102231612  266   267   1       266   1       266   266     99.248   266       2         0     0.0     535       100
gi|9294830|gb|AAF86698.1|AF180959_1   gnl|casseteDB|56691997   390   387   5       390   1       386   386     99.482   384       2         0     0.0     789       99  

10.6.5.3 Add a header line to name the fields and save a filtered summary table to file

# add a subject title (stitle), to get the description line of the proteins we are finding as top hits
echo -e 'qseqid\tsseqid\tqlen\tslen\tstitle\tlength\tpident\tpositive\tmismatch\tgaps\tevalue\tbitscore\tqcovs' > header1.tab

blastp -query unknown_prot.faa -db integron_cassettes4blastdb.faaED -outfmt '6 qseqid sseqid qlen slen stitle length pident positive mismatch gaps evalue bitscore qcovs' -num_alignments 10 > blastp_10hits.out

cat header1.tab blastp_10hits.out > blastp_10hits.tsv
cut -f1,5,7,13 blastp_10hits.tsv | head -20 | column -ts $'\t'
qseqid                                stitle                                                                                                                      pident   qcovs
gi|11037699|gb|AAG27703.1|AF300454_1  |[Pseudomonas aeruginosa]|blaVIM-3|VIM-3 metallo-beta-lactamase|801|DQ342344|                                               100.000  100
gi|11037699|gb|AAG27703.1|AF300454_1  |[Pseudomonas putida]|blaVIM-6|metallo-beta-lactamase VIM-6|801|EF522838|18757180                                           99.624   100
gi|11037699|gb|AAG27703.1|AF300454_1  |[Pseudomonas aeruginosa]|blaVIM|class B beta-lactamase|801|AJ628983|                                                       99.624   100
gi|11037699|gb|AAG27703.1|AF300454_1  |[Pseudomonas aeruginosa]|blaVIM-2|metallo-beta-lactamase VIM-2|801|DQ522233|                                               99.248   100
gi|11037699|gb|AAG27703.1|AF300454_1  |[Pseudomonas aeruginosa]|blaVIM-4|metallo-beta-lactamase VIM-4|801|AY460181|                                               90.226   100
gi|11037699|gb|AAG27703.1|AF300454_1  |[Pseudomonas aeruginosa]|blaVIM-1|VIM-1 metallo-beta-lactamase|801|AJ969237|                                               89.850   100
gi|11037699|gb|AAG27703.1|AF300454_1  |[Enterobacter cloacae]|blaVIM-5|metallo-beta-lactamase VIM-5|801|DQ023222|16189133                                         89.098   100
gi|11037699|gb|AAG27703.1|AF300454_1  |[Pseudomonas aeruginosa]|blaVIM-13|class B carbapenemase VIM-13|801|EF577407|18644957                                      87.594   100
gi|11037699|gb|AAG27703.1|AF300454_1  |[Pseudomonas aeruginosa]|blaVIM-7|metallo-b-lactamase|798|AJ536835|14693560                                                73.684   100
gi|11037699|gb|AAG27703.1|AF300454_1  |[Achromobacter xylosoxidans]|blaIMP-10|metallo-beta-lactamase IMP-10|741|AB074435|12019129                                 33.175   79
gi|9294830|gb|AAF86698.1|AF180959_1   |[Klebsiella pneumoniae]|acc-1|beta-lactamase|1161|AJ870923|16982793                                                        99.482   99
gi|9294830|gb|AAF86698.1|AF180959_1   |[Klebsiella pneumoniae]||cmy-8|1149|EF382672|17526756                                                                      45.361   98
gi|9294830|gb|AAF86698.1|AF180959_1   |[Escherichia coli]|cmy-13|CMY-13|1146|AY339625|16048979                                                                    39.314   96
gi|9294830|gb|AAF86698.1|AF180959_1   |[Salmonella enterica subsp. enterica serovar Newport str. SL254]|blaCMY-2-1|CMY-2 beta-lactamase 1|1230|CP000604|17375195  39.578   96
gi|9294830|gb|AAF86698.1|AF180959_1   |[Klebsiella pneumoniae]|ampC|beta-lactamase class C DHA-1|1140|AJ971343|16436717                                           38.131   99
gi|9294830|gb|AAF86698.1|AF180959_1   |[Klebsiella pneumoniae]||sfpB|828|EF382672|17526756                                                                        22.388   17
gi|9294830|gb|AAF86698.1|AF180959_1   |[Salmonella enterica subsp. enterica serovar Newport str. SL254]||hypothetical protein|378|CP000604|17375195               38.235   9
gi|9294830|gb|AAF86698.1|AF180959_1   |[Salmonella enterica subsp. enterica serovar Choleraesuis]|orf20|Orf20|240|EU219534|                                       25.316   18
gi|9294830|gb|AAF86698.1|AF180959_1   |[Salmonella enterica subsp. enterica serovar Newport str. SL254]||conserved hypothetical protein|303|CP000604|17375195     36.364   8
  • What can we learn from the output above? Examine for example the first two hits.

10.6.5.4 Annotate the query proteins using the single best hit and generate a simplified overview table of the top-scoring results

#>>> retrieve the best hit found for each input gene
blastp -query unknown_prot.faa -db integron_cassettes4blastdb.faaED -outfmt '6 qseqid sseqid qlen slen stitle length pident positive mismatch gaps evalue bitscore qcovs' -num_alignments 1 > blastp_tophits.out

cat header1.tab blastp_tophits.out > blastp_tophits.tsv
cut -f1,5,7,13 blastp_tophits.tsv | head | column -ts $'\t'
qseqid                                stitle                                                                                        pident   qcovs
gi|11037699|gb|AAG27703.1|AF300454_1  |[Pseudomonas aeruginosa]|blaVIM-3|VIM-3 metallo-beta-lactamase|801|DQ342344|                 100.000  100
gi|9294830|gb|AAF86698.1|AF180959_1   |[Klebsiella pneumoniae]|acc-1|beta-lactamase|1161|AJ870923|16982793                          99.482   99
gi|15705413|gb|AAL05630.1|AF395881_1  |[Klebsiella pneumoniae]|bla|extended-spectrum beta-lactamase CTX-M-19|876|AF458080|12936998  52.158   95

10.6.6 Retrieve the top-scoring hits from the database using blastdbcmd and a list of database identifiers (qseqid)

We can easily retrieve the top-scoring hits from the database using blastdbcmd using a list of qseqids, as shown below.

# get the sseqids only for the top hits with % sequence identity >= 95 and query coverage >= 97%
awk 'BEGIN{FS=OFS="\t"}NR > 1 && $7 >= 95 && $13 >= 97{print $2}' blastp_tophits.tsv > sseqids_for_top_hits_with_ge95pident_ge97qcovs.list

# finally retrieve the sequences for the sseqids_for_top_hits_with_ge95pident_ge97qcovs.list from the integron_cassettes4blastdb
blastdbcmd -db integron_cassettes4blastdb.faaED -entry_batch sseqids_for_top_hits_with_ge95pident_ge97qcovs.list -dbtype prot -out top_hits_with_ge95pident_ge97qcovs.faa

cat top_hits_with_ge95pident_ge97qcovs.faa

>casseteDB:87128428 |[Pseudomonas aeruginosa]|blaVIM-3|VIM-3 metallo-beta-lactamase|801|DQ342344|
MFKLLSKLLVYLTASIMAIASPLAFSVDSSGEYPTVSEIPVGEVRLYQIADGVWSHIATKSFDGAVYPSNGLIVRDGDEL
LLIDTAWGAKNTAALLAEIEKQIGLPVTRAVSTHFHDDRVGGVDVLRAAGVATYASPSTRRLAEVEGSEIPTHSLEGLSS
SGDAVRFGPVELFYPGAAHSTDNLVVYVPSASVLYGGCAIYELSRTSAGNVADADLAEWPTSIERIQQHYPEAQFVIPGH
GLPGGLDLLKHTTNVVKAHTNRSVVE*
>casseteDB:56691997 |[Klebsiella pneumoniae]|acc-1|beta-lactamase|1161|AJ870923|16982793
MQNTLKLLSVITCLAATVQGALAANIDESKIKDTVDDLIQPLMQKNNIPGMSVAVTVNGKNYIYNYGLAAKQPQQPVTEN
TLFEVGSLSKTFAATLASYAQVSGKLSLDQSVSHYVPELRGSSFDHVSVLNVGTHTSGLQLFMPEDIKNTTQLMAYLKAW
KPADAAGTHRVYSNIGTGLLGMIAAKSLGVSYEDAIEKTLLPQLGMHHSYLKVPADQMENYAWGYNKKDEPVHVNMEILG
NEAYGIKTTSSDLLRYVQANMGQLKLDANAKMQQALTATHTGYFKSGEITQDLMWEQLPYPVSLPNLLTGNDMAMTKSVA
TPIVPPLPPQENVWINKTGSTNGFGAYIAFVPAKKMGIVMLANKNYSIDQRVTVAYKILSSLEGNK*

10.6.7 A simple strategy to conveniently generate a custom BLAST table header based on the tabular output with comments (-outfmt 7), making use of the Shell’s process substitution syntax

If you read the detailed documentation of \(blastp\) blastp -help, you will find under the output section that -outfmt 7 is a tabular with comments.

Here a few lines of code to generate a standard tabular output with a customized header, parsing the -outfmt 7 output. Note the use of cmd <(command pipeline), a syntax called process substitution, which allows to use the ouptut of \(command\ pipeline\) as if it were a file for input into \(cmd\). This is convenient, as it avoids the need of writing an intermediate file.

# 1. Run blast[p] with -outfmt 7 and the desired output fields
blastp -query unknown_prot.faa -db integron_cassettes4blastdb.faaED -outfmt '7 qseqid sseqid qlen slen stitle length pident positive mismatch gaps evalue bitscore qcovs' -num_alignments 10 > blast_m7.tmp

# 2. use process substitution to parse the -outfmt 7 output file blast_m7.tmp, and generate the standard tabular optput with labeled columns
cat <(awk '/^# Fields:/ {sub(/# Fields: /, ""); gsub(/, /, "\t"); gsub(/ /, "_"); print $0; exit}' blast_m7.tmp) <(awk '!/^# /{print $0 }' blast_m7.tmp) | less

10.6.8 HOMEWORK 2

  • Based on the discussion on the blastp output in the previous section (7.2.4 onwards), what can you conclude about the classification of the third query sequence?
  • Run a blastp search on the NCBI portal against the nr database, restricting the number of hits to 10.
    • Based on these results, how would you classify protein AAL05630.1?
    • Explore the different output formats and data download options offered by the server
    • Upload the “Hit Table (text)” returned by the NCBI blast server to moodle
    • Convert the “Descriptions Table (CSV)” to TSV, skipping the last field (with the url) and upload both files to moodle, including the code line you used for the table cleanup operation
  • Based on the tabular output field formatting options we’ve learned about in this section, simplify the blastn call from exercise 6.1.4 to generate a table with the following fields: qseqid stitle qlen slen length pident
  • Repeat the blastp analysis using protein AAL05630.1 as query to interrogate the swissprot and refseq_protein databases (with 10 hits each), which you can access from buluc or tepeu at /export/space2/data/blast/db
    • upload the corresponding results tables with the ‘qseqid stitle qlen slen length pident’ fields
    • based on the evidence of these blastp results, what would be the best annotation for AAL05630.1?

11 Translating BLAST (\(BLASTX\)) - translate a DNA sequence to search a protein database

The aim here is to identify the possible gene-cassettes encoded in the PCR amplicons collected in 3cass_amplicons.fna, as explained in the previous background info section.

11.1 EXERCISE 5: Identification of CDSs in DNA sequences using \(BLASTX\)

BLASTX is generally used to find CDSs (protein-coding sequences) in genomic DNA or transcripts. As we will see, BLASTX is a powerful gene-finding tool, but we will need to examine many alignments to ensure that all CDSs encoded on a given DNA segment are found. For large DNA strings (>5-10 kb), I would recommend to segment the string into shorter, overlapping pieces, before running BLASTX

Remember that BLASTX uses a DNA query, which is translated in its 6 frames using by default the universal code, and the resulting products are used to interrogate a protein database.

11.1.1 Running blastx with additional fields in the output table

# run blastx asking for the top 3 hits in tabular output format and an evalue cutoff set at 1e-100
#  adding relevant information to the table, missing in the standard -m 6 output, like qlen, selen, qframe & qcovs
echo -e 'qseqid\tsseqid\tqlen\tslen\tqstart\tqend\tqframe\tsstart\tsend\tlength\tpident\tpositive\tmismatch\tgaps\tevalue\tbitscore\tqcovs' > header1.tab
blastx -query 3cass_amplicons.fna -query_gencode 11 -db integron_cassettes4blastdb.faaED -evalue 1e-100 -outfmt '6 qseqid sseqid qlen slen qstart qend qframe sstart send length pident positive mismatch gaps evalue bitscore qcovs' -num_alignments 3 -out 3blastX_hits_m6_header1.tab

# if 3blastX_hits_m6_header1.tab exists and has a non-zero size, then execute the following commands ...
[ -s 3blastX_hits_m6_header1.tab ] && cat header1.tab 3blastX_hits_m6_header1.tab > ed && mv ed 3blastX_hits_m6_header1.tab

head 3blastX_hits_m6_header1.tab | column -t
qseqid  sseqid                   qlen  slen  qstart  qend  qframe  sstart  send  length  pident  positive  mismatch  gaps  evalue     bitscore  qcovs
Am1     gnl|casseteDB|146151084  1706  273   887     1699  2       2       273   272     98.897  270       2         1     0.0        536       48
Am1     gnl|casseteDB|157674384  1706  270   899     1699  2       4       270   267     99.625  266       1         0     0.0        533       47
Am1     gnl|casseteDB|161087337  1706  268   899     1699  2       2       268   267     99.625  266       1         0     0.0        533       47
Ap18    gnl|casseteDB|90265366   1762  277   39      836   3       11      276   266     95.865  258       11        0     0.0        508       45
Ap18    gnl|casseteDB|84095097   1762  285   39      836   3       19      284   266     95.489  258       12        0     3.97e-180  506       45
Ap18    gnl|casseteDB|156628012  1762  285   39      836   3       19      284   266     95.113  257       13        0     2.43e-179  504       45
Ap26    gnl|casseteDB|156628012  945   285   16      870   1       1       285   285     99.649  285       1         0     0.0        573       90
Ap26    gnl|casseteDB|133905080  945   336   1       870   1       47      336   290     97.241  285       8         0     0.0        572       92
Ap26    gnl|casseteDB|84095097   945   285   16      870   1       1       285   285     99.649  284       1         0     0.0        572       90
  • Explore in detail the results in the qlen, slen, qstart, qend, length and qcovs columns
  • What do you notice regarding the hits of Am1 and Ap18?
  • In what important aspect do the former two sequence hits differ from those retrieved for Ap26?

11.1.2 Many hits are required to get HSPs aligning over the entire DNA querey sequence and allow the identification of smaller CDSs

As we saw in the output from the previous \(blastx\) run, the hits for each query sequence align only to a certain portion of the latter. We will need to substantially increase the number of hits to have a chance to get lower-scoring HSPs that align to other regions (smaller genes) of the DNA sequence used as query.

11.1.2.1 Increasing the number of alignments to 100 and adding the stitle field to the output table

#>>> Lets try to improve the analysis by increasing the no. of hits for each sequence to 100
#    and adding the stitle column to the output table
echo -e 'qseqid\tsseqid\tstitle\tqlen\tslen\tqstart\tqend\tsstart\tsend\tlength\tpident\tpositive\tmismatch\tgaps\tevalue\tbitscore\tqcovs\tqframe' > header2.tab

#>>> print a numvered list of the positional indexes of each column
sed 's/\t/\n/g' header2.tab | nl
     1  qseqid
     2  sseqid
     3  stitle
     4  qlen
     5  slen
     6  qstart
     7  qend
     8  sstart
     9  send
    10  length
    11  pident
    12  positive
    13  mismatch
    14  gaps
    15  evalue
    16  bitscore
    17  qcovs
    18  qframe
blastx -query 3cass_amplicons.fna -query_gencode 11 -db integron_cassettes4blastdb.faaED -evalue 1e-10 -outfmt '6 qseqid sseqid stitle qlen slen qstart qend sstart send length pident positive mismatch gaps evalue bitscore qcovs qframe' -num_alignments 100 -out 100blastX_hits_m6_header2.tab

[ -s 100blastX_hits_m6_header2.tab ] && cat header2.tab 100blastX_hits_m6_header2.tab > ed && mv ed 100blastX_hits_m6_header2.tab

11.1.2.3 Generate a list of non-redundant qstart & qend positions for each query sequence

#>>> This 1liner generates a non-redundant list of the qstart qend positions for each of the query sequences. 
#    These are extracted from the full table with $(cut -f1 100blastX_hits_m6_header2.tab | grep -v qseqid | sort -u)
#    Then, for each query, grep it out of the table with grep $query 100blastX_hits_m6_header2.tab and keep only the fields 6,7: qstart, qend
#    We separte and label the results corresponding to each query sequence using the echo "#>>> $query"; as header 
#    End the corresponding rerport with echo '------------------------';
for query in $(cut -f1 100blastX_hits_m6_header2.tab | sed '1d' | sort -u)
do 
     echo "#>>> $query"
     grep $query 100blastX_hits_m6_header2.tab | cut -f6,7 | sort -u
     echo '------------------------'
done
##### OUTPUT of the for loop from the previous block
#>>> Am1
1361    1699
15  488
18  488
33  488
3   488
3   500
588 887
588 902
612 875
612 887
612 902
887 1699
899 1678
899 1699
908 1303
908 1516
908 1678
908 1684
908 1693
908 1696
908 1699
920 1699
------------------------
#>>> Ap18
1046    1762
1052    1762
1079    1528
1079    1702
1085    1738
1088    1738
1118    1738
1130    1738
1139    1738
36  836
39  812
39  836
48  173
48  443
48  656
48  797
48  812
48  830
48  836
48  860
501 836
60  836
------------------------
#>>> Ap26
1   192
16  867
16  870
1   870
532 870
58  870
61  870
70  870
79  204
79  474
79  687
79  861
79  864
79  867
79  870
91  870
------------------------
  • Focus on the Am1 query sequence: How many genes may it encode for?
# Identify the most likely gene start|end positions
grep Am1 100blastX_hits_m6_header2.tab | cut -f6,7 | sort | uniq -c | sort -nk2
      1 3   488
      4 3   500
      2 15  488
     13 18  488
      2 33  488
      1 588 887
      1 588 902
      1 612 875
      1 612 887
      6 612 902
      1 887 1699
      1 899 1678
      6 899 1699
      1 908 1303
      1 908 1516
      2 908 1696
     35 908 1699
      3 908 1693
      4 908 1684
      6 908 1678
      7 920 1699
      1 1361    1699
  • Ahaaaa! What can you learn from this output?
  • looks like if there are several genes encoded in each query sequence, particularly in the larger ones like Am1 (and Ap18, if you repeat the exercise for it)

11.1.2.4 Detailed analysis of the candidate CDSs/genes identified

#>>> Lets explore this in more detail. We'll start with the Am1 query
#    One class of hits seems to start at position 3, ending at around position 500
#    What are the hits? Lets grep them out of the table; 
#    note the use of a regular expression '3[[:space:]]500' which denotes a 3 followed by a space or tab, 
#    followed by a 500 (i.e. our qstart qend coordinates)
grep Am1 100blastX_hits_m6_header2.tab | grep '3[[:space:]]500' | cut -f1,3,6,7,11,17,18 | column -ts $'\t'

# Or somewhat more flexibly, using AWK, which will allow us to include the header in the output, as shown below
awk 'BEGIN{FS=OFS="\t"}NR==1 || (/^Am1/ && $6 == 3 && $7 == 500){print $1,$3,$6,$7,$17,$18}' Am1 100blastX_hits_m6_header2.tab | column -ts $'\t'
qseqid  stitle                                                                                       qstart  qend  qcovs  qframe
Am1     |[Serratia marcescens]|dhfrXII|dihydrofolate reductase type XII|498|AF284063|                3       500   29     3
Am1     |[Acinetobacter baumannii]|dhrfXII|dihydrofolate reductase|498|DQ995286|19365688             3       500   29     3
Am1     |[Salmonella enterica subsp. enterica serovar Choleraesuis]|dhfr|Dhfr|498|AY551331|15145503  3       500   29     3
Am1     |[Enterococcus faecalis]|dhfrXII|dihydrofolate reductase|498|AB196348|16451182               3       500   29     3

11.1.2.5 Retrieve the best hit found using blastdbcmd


#>>> Retrieve the best hit gnl|casseteDB|10444105 found by the search in the previous step
blastdbcmd -db integron_cassettes4blastdb.faaED -entry 'gnl|casseteDB|10444105' -dbtype prot -out S_marcescens_dhrXII_casseteDB-10444105.faa

#>>> Retrieve all hits found by the search in the previous step for the Am1 gene starting at pos 899 and ending at pos 1699
grep Am1 100blastX_hits_m6_header2.tab | grep '899[[:space:]]1699'
grep Am1 100blastX_hits_m6_header2.tab | grep '899[[:space:]]1699' | cut -f2 > Am1_qs899_qe1699_hit_IDs.list
blastdbcmd -db integron_cassettes4blastdb.faaED -entry_batch Am1_qs899_qe1699_hit_IDs.list -dbtype prot -out Am1_qs899_qe1699_AadA_best_hits.faa

11.1.3 Graphical exploration of standard -outfmt 6 output using the with blast-imager.pl script

As we’ve learned in the example above, we will need to explore many alignments from a BLASTX search in order to make sure that we identify all CDSs on a DNA string. The blast-imager.pl script is a very convenient tool to visualize results, when we have large lists of hits matching on different parts of our query sequence, exactly as in our case.

To run the script, we will first split the multifasta of query sequences into individual sequences, to generate graphs the the hit table of individual queries.

11.1.3.1 Copy or fetch the scripts and place them in your $HOME/bin directory


# 1. If workin on tepeu or buluc, copy required scripts to your $HOME/bin directory
[ ! -d $HOME/bin ] && mkdir $HOME/bin
cp /home/vinuesa/cursos/intro_biocomp/scripts/*.* $HOME/bin


# 2. if you don't have access to the servers @ LCG, fetch the scripts directly from the command line
wget -c https://raw.githubusercontent.com/vinuesa/TIB-filoinfo/master/blast-imager.pl
wget -c https://raw.githubusercontent.com/vinuesa/TIB-filoinfo/master/fasta_toolkit.awk
wget -c https://raw.githubusercontent.com/vinuesa/TIB-filoinfo/master/split_fasta.pl

chmod 755 blast-imager.pl fasta_toolkit.awk split_fasta.pl

11.1.3.2 Split multifasta into the individual source sequence files

# Split the multifasta into the individual source sequence files, 
# that is, generate single fasta files from the multifasta source file
split_fasta.pl 3cass_amplicons.fna
ls -ltr | tail -3
-rw-r--r--. 1 vinuesa faculty    967 oct 18 21:54 Ap26.fa
-rw-r--r--. 1 vinuesa faculty   1798 oct 18 21:54 Ap18.fa
-rw-r--r--. 1 vinuesa faculty   1740 oct 18 21:54 Am1.fa

11.1.3.3 Run \(blastx\) within a for loop piping its output to blast-imager.pl

# Run the blast-imager.pl script
# i) first writing the tables to file
#for file in *fa
#do 
#   blastx -query $file -query_gencode 11 -db integron_cassettes4blastdb.faaED -evalue 1e-20 -outfmt 6 -num_alignments 100 -out ${file%.fa}_blastxout.tab; 
#   ./blast-imager.pl ${file%.fa}_blastxout.tab > ${file%.*}.png
#   rm ${file%.fa}_blastxout.tab
#done

# ii) or directly to the graph, by piping the blastx oputput directly into the blast-imager script like so:
#for file in *fa; do blastx -query $file -query_gencode 11 -db integron_cassettes4blastdb.faaED -evalue 1e-20 -outfmt 6 -num_alignments 100 | ../../blast-imager.pl > ${file%.*}.png; done

# Run again, imposing some filtering to remove the poorest hits: pident > 98%; mismatches < 4; gaps < 3 
# which will render a better figure
for file in *fa; do blastx -query $file -query_gencode 11 -db integron_cassettes4blastdb.faaED -evalue 1e-20 -outfmt 6 -num_alignments 100 | awk '$3 > 98 && $5 < 4 && $6 < 3' | blast-imager.pl > ${file%.*}.png; done

# Have a look at the results for the three query sequences using the "eye of GNOME - eog" 
eog Am1.png &
eog Ap18.png &
eog Ap26.png &

# or all at once:
eog *.png &

Figure 3. The graphical output generated by blast-imager.pl for the 100 Am1 blastx hits

11.1.3.4 Write a table with the likely gene coordinates for Am1

The graphical display shown above in Fig. 3 and the numeric coordinates extracted previously from the hit table, strongly suggest that there are 3 gene cassettes encoded in the amplicon generated for Am1.

      1 3   488
      4 3   500
      2 15  488
     13 18  488
      2 33  488
      1 588 887
      1 588 902
      1 612 875
      1 612 887
      6 612 902
      1 887 1699
      1 899 1678
      6 899 1699
      1 908 1303
      1 908 1516
      2 908 1696
     35 908 1699
      3 908 1693
      4 908 1684
      6 908 1678
      7 920 1699
      1 1361    1699

Looking closely at the coordinates and figure 3, we can write the following table with the likely gene coordinates

Table 1. Coordinates for gene cassettes encoded in the Am1 amplicon DNA sequence.

start end strand note
3 500 + -
612 902 + The 887 and 899 starts would overlap with the first gene; this is not typical for gene cassettes
908 1699 + -

11.1.3.5 Extract the individual cassette genes with \(fasta\_toolkit.awk\) and check for possible start and end condons

We will use the fasta_toolkit.awk script to extract the three gene cassettes from the corresponding DNA string using the their coordinates, as defined in Table 1.

  • 1st gene
fasta_toolkit.awk -R 3 -s 3 -e 500 Am1.fa
>Am1
ATGAACTCGGAATCAGTACGCATTTATCTCGTTGCTGCGATGGGAGCCAATCGGGTTATTGGCAATGGTCCTAATATCCCCTGGAAAATTCCGGGTGAGCAGAAGATTTTTCGCAGACTCACTGAGGGAAAAGTCGTTGTCATGGGGCGAAAGACCTTTGAGTCTATCGGCAAGCCTCTACCGAACCGTCACACATTGGTAATCTCACGCCAAGCTAACTACCGCGCCACTGGCTGCGTAGTTGTTTCAACGCTGTCGCACGCTATCGCTTTGGCATCCGAACTCGGCAATGAACTCTACGTCGCGGGCGGAGCTGAGATATACACTCTGGCACTACCTCACGCCCACGGCGTGTTTCTATCTGAGGTACATCAAACCTTCGAGGGTGACGCCTTCTTCCCAATGCTCAACGAAACAGAATTCGAGCTTGTCTCAACCGAAACCATTCAAGCTGTAATTCCGTACACCCACTCCGTTTATGCGCGTCGAAACGGCTAA
  • Lets save the three CDSs to file
# write first CDS to Am1_cassette_CDSs.fna, and append with 2cnd and 3rd to it using >>
fasta_toolkit.awk -R 3 -s 3 -e 500 Am1.fa > Am1_cassette_CDSs.fna
fasta_toolkit.awk -R 3 -s 612 -e 902 Am1.fa >> Am1_cassette_CDSs.fna
fasta_toolkit.awk -R 3 -s 908 -e 1699 Am1.fa >> Am1_cassette_CDSs.fna
  • We need to add consecutive ORF/gene names to distinguish them from one another
grep '^>' Am1_cassette_CDSs.fna
>Am1
>Am1
>Am1

perl -pe 'if(/^>(\S+)/){ $c++; $h=$1 . "_ORF" . $c; s/$1/$h/ }' Am1_cassette_CDSs.fna
perl -pe 'if(/^>(\S+)/){ $c++; $h=$1 . "_ORF" . $c; s/$1/$h/ }' Am1_cassette_CDSs.fna > ed && mv ed Am1_cassette_CDSs.fna

cat Am1_cassette_CDSs.fna
>Am1_ORF1
ATGAACTCGGAATCAGTACGCATTTATCTCGTTGCTGCGATGGGAGCCAATCGGGTTATTGGCAATGGTCCTAATATCCCCTGGAAAATTCCGGGTGAGCAGAAGATTTTTCGCAGACTCACTGAGGGAAAAGTCGTTGTCATGGGGCGAAAGACCTTTGAGTCTATCGGCAAGCCTCTACCGAACCGTCACACATTGGTAATCTCACGCCAAGCTAACTACCGCGCCACTGGCTGCGTAGTTGTTTCAACGCTGTCGCACGCTATCGCTTTGGCATCCGAACTCGGCAATGAACTCTACGTCGCGGGCGGAGCTGAGATATACACTCTGGCACTACCTCACGCCCACGGCGTGTTTCTATCTGAGGTACATCAAACCTTCGAGGGTGACGCCTTCTTCCCAATGCTCAACGAAACAGAATTCGAGCTTGTCTCAACCGAAACCATTCAAGCTGTAATTCCGTACACCCACTCCGTTTATGCGCGTCGAAACGGCTAA
>Am1_ORF2
ATGTTTATTCAAACGGCATTTAGCTTTTCAGGCGTTATTCAGTGCCTGTTTTGCCTTTTTTCCGGGCTTCGCCTGCATGGGCTGCGCAGGTTTTCAGTCTTTTTGGCCTCTAGCCCTTGCGTAGCAAGCGCAAGCAGCTATCGTTTTTGCAGTGCTGTGCCGCCTCGGTGGCGCAGCGTTTTTTCACGGTTAGCGCCCGTCGCCAAATTCAAGTTATCCGTTTTGGCTTCTGGGTCTAACATTTCGGTCAAGCCGACCGCCATTCTGCGGTCGGCTTACCTCGCCCGTTAG
>Am1_ORF3
ATGAGGGAAGCGGTGACCATCGAAATTTCGAACCAACTATCAGAGGTGCTAAGCGTCATTGAGCGCCATCTGGAATCAACGTTGCTGGCCGTGCATTTGTACGGCTCCGCAGTGGATGGCGGCCTGAAGCCATACAGCGATATTGATTTGTTGGTTACTGTGGCCGTAAAGCTTGATGAAACGACGCGGCGAGCATTGCTCAATGACCTTATGGAGGCTTCGGCTTTCCCTGGCGAGAGCGAGACGCTCCGCGCTATAGAAGTCACCCTTGTCGTGCATGACGACATCATCCCGTGGCGTTATCCGGCTAAGCGCGAGCTGCAATTTGGAGAATGGCAGCGCAATGACATTCTTGCGGGTATCTTCGAGCCAGCCATGATCGACATTGATCTAGCTATCCTGCTTACAAAAGCAAGAGAACATAGCGTTGCCTTGGTAGGTCCGGCAGCGGAGGAATTCTTTGACCCGGTTCCTGAACAGGATCTATTCGAGGCGCTGAGGGAAACCTTGAAGCTATGGAACTCGCAGCCCGACTGGGCCGGCGATGAGCGAAATGTAGTGCTTACGTTGTCCCGCATTTGGTACAGCGCAATAACCGGCAAAATCGCGCCGAAGGATGTCGCTGCCGACTGGGCAATAAAACGCCTACCTGCCCAGTATCAGCCCGTCTTACTTGAAGCTAAGCAAGCTTATCTGGGACAAAAAGAAGATCACTTGGCCTCACGCGCAGATCACTTGGAAGAATTTATTCGCTTTGTGAAAGGCGAGATCATCAAGTCAGTTGGTAAATGA

11.1.3.6 Translate the CDSs to their conceptual translation products using the universal genetic code

This operation can be easily performed with the help of fasta_toolkit.awk

fasta_toolkit.awk -R 4 Am1_cassette_CDSs.fna > Am1_cassette_CDSs_translated.faa
  • Visualize the resulting file
cat Am1_cassette_CDSs_translated.faa
>Am1_ORF1
MNSESVRIYLVAAMGANRVIGNGPNIPWKIPGEQKIFRRLTEGKVVVMGRKTFESIGKPLPNRHTLVISRQANYRATGCVVVSTLSHAIALASELGNELYVAGGAEIYTLALPHAHGVFLSEVHQTFEGDAFFPMLNETEFELVSTETIQAVIPYTHSVYARRNG*
>Am1_ORF2
MFIQTAFSFSGVIQCLFCLFSGLRLHGLRRFSVFLASSPCVASASSYRFCSAVPPRWRSVFSRLAPVAKFKLSVLASGSNISVKPTAILRSAYLAR*
>Am1_ORF3
MREAVTIEISNQLSEVLSVIERHLESTLLAVHLYGSAVDGGLKPYSDIDLLVTVAVKLDETTRRALLNDLMEASAFPGESETLRAIEVTLVVHDDIIPWRYPAKRELQFGEWQRNDILAGIFEPAMIDIDLAILLTKAREHSVALVGPAAEEFFDPVPEQDLFEALRETLKLWNSQPDWAGDERNVVLTLSRIWYSAITGKIAPKDVAADWAIKRLPAQYQPVLLEAKQAYLGQKEDHLASRADHLEEFIRFVKGEIIKSVGK*

11.2 Assign functional annotations to the 3 CDSs encoded in the Am1 class-I integron gene cassettes

We will run BLASTP a final time to get the best functional annotations for the three ORFs (gene cassettes) identified via BLASTX in the Am1 amplicon.

# 1. run blastp with the translated products and get 10 top hits
echo -e 'qseqid\tsseqid\tqlen\tslen\tstitle\tlength\tpident\tpositive\tmismatch\tgaps\tevalue\tbitscore\tqcovs' > header1.tab
blastp -query Am1_cassette_CDSs_translated.faa -db integron_cassettes4blastdb.faaED -outfmt '6 qseqid sseqid qlen slen stitle length pident positive mismatch gaps evalue bitscore qcovs' -num_alignments 10 > blastp_10hits.out

cat header1.tab blastp_10hits.out > ed && mv ed blastp_10hits.out

# 2. reduce the output table to relevant fields for easier visualization
sed 's/\t/\n/g' header1.tab | cat -n
     1  qseqid
     2  sseqid
     3  qlen
     4  slen
     5  stitle
     6  length
     7  pident
     8  positive
     9  mismatch
    10  gaps
    11  evalue
    12  bitscore
    13  qcovs
awk 'BEGIN{FS="\t"; OFS=","; print "qseqid,qlen,slen,stitle,pident,bitscore,qcovs"}{print $1,$3,$4,$5,$7,$12,$13}' blastp_10hits.out > Am1_top10_blastp_hits_for_3ORFs.csv
awk 'BEGIN{FS=OFS="\t"; print "qseqid\tqlen\tslen\tstitle\tpident\tbitscore\tqcovs"}{print $1,$3,$4,$5,$7,$12,$13}' blastp_10hits.out > Am1_top10_blastp_hits_for_3ORFs.tsv
qseqid,qlen,slen,stitle,pident,bitscore,qcovs
Am1_ORF1,166,166,|[Serratia marcescens]dhfrXII|dihydrofolate reductase type XII|498|AF284063|unpubl,100.000,340,100
Am1_ORF1,166,166,|[Acinetobacter baumannii]|ZJB04|||dhrfXII|dihydrofolate reductase|498|DQ995286|Xu_H.|Curr.Microbiol.59_113-117_2009|19365688,99.398,338,100
Am1_ORF1,166,166,|[Salmonella enterica subsp. enterica serovar Choleraesuis]dhfr|Dhfr|498|AY551331|Huang_T.M.|Vet.Microbiol.100_247-254_2004|15145503,99.398,338,100
Am1_ORF1,166,166,|[Enterococcus faecalis]|||human male bone|China:GuangzhoudhfrXII|dihydrofolate reductase|498|AB196348|Su_J.|FEMSMicrobiol.Lett.254_75-80_2006|16451182,98.795,337,100
Am1_ORF1,166,178,|[Salmonella enterica subsp. enterica serovar Choleraesuis]|OU7519|hybrid of pSCV50 and pSC138||tri|Tri|534|EU219534|unpubl,100.000,333,98
Am1_ORF1,166,158,|[Escherichia coli]||Rs411b||dhfr17|dihydrofolate reductase|474|DQ875874|Ozugumus_O.B.|J.Microbiol.45_379-387_2007|17978796,36.709,108,95
Am1_ORF1,166,163,|[Escherichia coli]dfr17|dihydrofolate reductase|491|AY828551|Zhang_H.|Microbiol.Immunol.48_639-645_2004|15383699,36.709,108,95
Am1_ORF1,166,158,|[Acinetobacter baumannii]||607460||dfrVII|dihydrofolate reductase|474|EU340417|Xu_X.|Int.J.Antimicrob.Agents32_441-445_2008|18757181,38.608,107,95
Am1_ORF1,166,183,|[Escherichia coli]DfrA17||549|AM886293|unpubl,36.709,108,95
Am1_ORF1,166,158,|[Escherichia coli]||clinical isolate||dhfrIb|dihydrofolate reductase type Ib|474|Z50805|unpubl,37.342,105,95
Am1_ORF2,97,97,|[Serratia marcescens]|unknown|291|AF284063|unpubl,98.969,183,100
Am1_ORF2,97,97,|[Klebsiella pneumoniae]|KpS15unknown|291|AF180731|unpubl,97.938,183,100
Am1_ORF2,97,97,|[Enterococcus faecalis]|||human male bone|China:Guangzhou|putative protein|291|AB196348|Su_J.|FEMSMicrobiol.Lett.254_75-80_2006|16451182,97.938,181,100
Am1_ORF2,97,97,|[uncultured bacterium]|unknown|291|AY115474|Tennstedt_T.|FEMSMicrobiol.Ecol.45_239-252_2003|19719593,94.845,176,100
Am1_ORF2,97,97,|[uncultured bacterium]|||arable soil||hypothetical protein|291|DQ779005|Heuer_H|Environ.Microbiol.9_657-666_2007|17298366,78.351,137,100
Am1_ORF2,97,105,|[Aeromonas salmonicida subsp. salmonicida]|123-86 3101unknown|315|AF327731|L'Abee-Lund_T.M|Microb.DrugResist.7_263-272_2001|11759088,78.351,137,100
Am1_ORF2,97,101,|[Pseudomonas aeruginosa]|orfD|303|AY460181|unpubl,77.174,129,95
Am1_ORF2,97,93,|[Pseudomonas aeruginosa]||4677||MexicoorfD|hypothetical protein|279|EF184217|unpubl,76.087,127,95
Am1_ORF2,97,88,|[Salmonella enterica subsp. enterica serovar Enteritidis]||Homo sapiens||Brazil|unknown|266|DQ278190|Peirano_G.|J.Antimicrob.Chemother.58_305-309_2006|16782743,76.136,122,91
Am1_ORF2,97,97,|[Salmonella enterica subsp. enterica serovar Agona]||Homo sapiens||Brazil|unknown|291|DQ278189|Peirano_G.|J.Antimicrob.Chemother.58_305-309_2006|16782743,76.289,115,100
Am1_ORF3,264,273,|[Klebsiella pneumoniae]|NK29aadA2|819|EF382672|Chen_Y.T.|Antimicrob.AgentsChemother.51_3004-3007_2007|17526756,100.000,531,100
Am1_ORF3,264,264,|[Serratia marcescens]aadA2|streptomycin/spectinomycin adenyltransferase|792|AF284063|unpubl,100.000,530,100
Am1_ORF3,264,264,|[uncultured bacterium]aadA2|streptomycin/spectinomycin adenylyltransferase|792|AY115474|Tennstedt_T.|FEMSMicrobiol.Ecol.45_239-252_2003|19719593,99.621,529,100
Am1_ORF3,264,264,|[Aeromonas hydrophila]|CVM861||diarrheic pig|aada2|aminoglycoside adenyltransferase|792|AY522923|Poole_T.L.|J.Antimicrob.Chemother.57_31-38_2006|16339607,99.621,529,100
Am1_ORF3,264,268,|[Salmonella enterica subsp. enterica serovar Choleraesuis]|OU7519|hybrid of pSCV50 and pSC138||aadA2|AadA2|804|EU219534|unpubl,99.621,528,100
Am1_ORF3,264,270,|[Proteus mirabilis]|C04014|||aadA2|AadA2|810|EU006711|Boyd_D.A.|Antimicrob.AgentsChemother.52_340-344_2008|18025121,99.621,528,100
Am1_ORF3,264,264,|[Proteus mirabilis]aadA2|aminoglycoside adenyltransferase|792|AY681136|unpubl,99.621,528,100
Am1_ORF3,264,264,|[uncultured bacterium]|||arable soil|aadA2|aminoglycoside-3'-adenylyltransferase|792|DQ779004|Heuer_H|Environ.Microbiol.9_657-666_2007|17298366,99.621,527,100
Am1_ORF3,264,264,|[Salmonella typhimurium DT104]|H3380|Salmonella typhimurium phage type DT104||aadA2|streptomycin/spectinomycin adenyltransferase|792|AF071555|Briggs_C.E|Antimicrob.AgentsChemother.43_846-849_1999|10103189,99.242,526,100
Am1_ORF3,264,264,|[Escherichia coli]|W21-3||sewage of swine farm|China: GuangzhouaadA2|streptomycin/spectinomycin 3' adenyltransferase|792|DQ157750|unpubl,98.864,522,100

Table 2 Annotation of the gene-cassettes found in the Am1 amplicon

ORF start end strand gene annotation
1 3 500 + dhfrXII dihydrofolate reductase type XII
2 612 902 + orfD hypothetical protein
3 908 1699 + aadA2 aminoglycoside adenyltransferase type II

Figure 4. Prevalence of class-1 integrons and distribution of identified gene cassette arrays among Enterobacteriaceae isolates recovered from clean and contaminated rivers in Morelos, Mexico. Results from Jazmín Ramos Madrigal’s bachelor thesis @lcgunam.

11.2.1 HOMEWORK 3

Repeat the BLASTX based identification of gene cassettes for the Ap18 and Ap26 amplicons, extract the corresponding CDSs, translate them and annotate them with BLASTP, writing a table with the single-best annotation for each ORF.